Meridian Design Doc 4: Spark Fraud Detection

See also the thread model and the proposal for retrieval attestations in Boost

(1) Trusted SPs

One of the main issues we face is whether the SPs are returning the file that they should be returning in the response body, and not just returning the headers which includes the signature chain. Specifically, an SP can create a valid signature chain that gives a proof of relay between the Station operator and the SP but can then simply return an empty response body. Then the evaluation of the signature chain will still be valid, even though no response body was returned.

We can simplify the Spark PoC by only using SPs that we can trust to return the right file in the response body.

(2) Proof of data possession

This one seems complex but could lean on the existing code in Filecoin to prove possession of data.

(3) Hash comparisons

The idea here is for the same CID to be used by a handful of Spark jobs and then for the hashes of the files to be compared by the orchestrator to make sure that the file was returned to the client who was then able to hash it.

(4) Pre-computed retrieval proofs with time-lock encryption

Initial setup

Let’s say we have a storage deal client that wants to setup retrieval checks at times t1, t2, … tN. This client has access to the raw content being stored.

For each tX
- The client will define a window when the check can be performed. E.g. between 1pm-2pm on 2023-07-31.
- The client generates a unique nonce value and encrypts it using timelock encryption so that it cannot be decoded before the window starts.
- The client combines the nonce with the raw content and derives a constant-length proof in such a way that the proof is unique to this check and can be recreated only by having access to nonce and the entire content. (Example algorithm: SHA256(XOR(nonce, content))).
- The client encrypts this proof using timelock encryption so that it can be decoded one hour after the window ends. (The one-hour delay was chosen arbitrarily, we can tweak it.)

Single retrieval check

SPARK client receives a retrieval job definition.
1. If the job comes from a centralised Orchestrator, then the orchestrator can open the timelock vault to obtain nonce.
1. If the job comes from a smart contract or a similar decentralised component, then the client needs to open the timelock on its own.

The client retrieves the given CID from the given Storage provider and re-creates the proof using nonce and retrieved content.

The client submits the proof alongside other retrieval data, e.g. attestation token and telemetry.

Later, after the timelock for proof opens, the SPARK Fraud Detection service verifies whether the client did the job correctly:
- Did the client submit the proof within the specified window? (In practice: before the time-locked vault containing the proof opened. This gives us some buffer to account for network delays, and that’s why the second timelock must open later than the window ends.)
- Does the proof submitted by the client match the proof stored in the time-locked vault?

Assumptions & attack vectors

This scheme assumes the party submitting time-locked proofs can be trusted.

Fortunately, it should be relatively easy to detect cheating: after both time-locked vaults open, and as long as the original content can still be retrieved, anybody can download the raw content, obtain the nonce, and check whether the time-locked proof was correct.

Then it’s a matter of setting up a staking/slashing scheme for parties funding retrieval checks and an incentive scheme for parties detecting & reporting fraud.

Simplified centralised version

At the beginning of each period (e.g. once a month), a trusted party with access to raw data creates N pairs of (nonce, proof) and submits them to SPARK Orchestrator/DB. This is done for every CID we want to check.
- In the beginning, when SPARK is checking a small subset of data stored in Filecoin, it may be feasible to run our own service that will download data for each CID and pre-compute the proofs. The trick is that we run this infrequently, amortise the cost of downloading the data across many future checks. At any given time, this service needs to keep data for a single CID only, therefore the storage requirements should stay reasonably low.

When a job is created to retrieve CID, we take the next unused (nonce, proof) pair from the database and assign it to the job.

Orchestrator sends nonce to SPARK client as part of job definition.

SPARK client computes proof and submits it together with other job data.

Later, the fraud detection (MERidian evaluator) compares the expected proof against the proof submitted by the client to decide whether the job was correctly performed.

Storage requirements

Let’s say we want the network to check retrievability of each CID every minute, we pre-calculate proofs once a month, and we need 200 bytes to store nonce, proof and the foreign key reference to CID. That costs 30 days * 24 hours * 60 minutes * 200 bytes = 8640000 bytes, i.e. ~8.3MB per CID.

To check the top one million CIDs, we will need ~8.3TB.

Fly.io charges $0.15 per GB of storage, 8.3TB will cost $1245/month.

Amazon S3 charges $0.023 per GB, 8.3TB will cost $191/month.

Cost of fetching CID and computing the proof via AWS Lambda

Data transfer from internet to AWS Lambda is free (i.e. download CID content)

Data transfer from AWS Lambda to AWS S3 is free (i.e. store the proofs)

Let’s say we limit content size to 512 MB (which is a reasonable limit considering Stations will run retrievals on consumer-level computers), the Lambda function keeps the content in memory and uses additional 256MB of RAM for its code & state. That gives us 768 MB RAM used, which costs $0.0000375075/second. Let’s say it takes on average 5 seconds to download the CID content and another 8640 seconds to compute the proofs (assuming 200ms per single proof). That gives us 8645 seconds charged at $0.324, per CID per month.

To check the top one million CIDs, we will pay $324,000 per month 😟 And that assumes we can create nonce and calculate the proof under 200ms, which is very optimistic for a 512 MB payload.